Information Theory of DNA Shotgun Sequencing 

Abolfazl Motahari, Guy Bresler and David Tse 

Department of Electrical Engineering and Computer Sciences 
en 

(3 ■ University of California, Berkeley 

(N 

XJ ! {motahari,gbresler,dtse}@eecs. berkeley.edu 

D ' 

^: 

Abstract 

DNA sequencing is the basic workhorse of modern day biology and medicine. Shot- 
c/3 ' gun sequencing is the dominant technique used: many randomly located short frag- 

ments called reads are extracted from the DNA sequence, and these reads are assembled 
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\^ I required for reliable reconstruction? This number provides a fundamental limit to the 

cn . performance of any assembly algorithm. For a simple statistical model of the DNA 

Cn I sequence and the read process, we show that the answer admits a critical phenomena 

in the asymptotic limit of long DNA sequences: if the read length is below a thresh- 

^ I old, reconstruction is impossible no matter how many reads are observed, and if the 

H ■ 
_Cy_' read length is above the threshold, having enough reads to cover the DNA sequence 

is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy 

rate of the DNA sequence. We also study the impact of noise in the read process on 

the performance. 



to reconstruct the original sequence. A basic question is: given a sequencing technol- 
ogy and the statistics of the DNA sequence, what is the minimum number of reads 



ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGGTAATCGCTGCATATC 

genome length G ^ 10^ 



A^ reads 



read length L ^ 100 iV ~ 10^ 

Figure 1: Schematic for shotgun sequencing. 

1 Introduction 

1.1 Background and Motivation 

DNA sequencing is the basic workhorse of modern day biology and medicine. Since the se- 
quencing of the Human Reference Genome ten years ago, there has been an explosive advance 
in sequencing technology, resulting in several orders of magnitude increase in throughput and 
decrease in cost. This advance allows the generation of a massive amount of data, enabling 
the exploration of a diverse set of questions in biology and medicine that were beyond reach 
even several years ago. These questions include discovering genetic variations across different 
humans (such as single-nucleotide polymorphisms SNPs), identifying genes affected by mu- 
tation in cancer tissue genomes, sequencing an individual's genome for diagnosis (personal 
genomics), and understanding DNA regulation in different body tissues. 

Shotgun sequencing is the dominant method currently used to sequence long strands of 
DNA, including entire genomes. The basic shotgun DNA sequencing set-up is shown in 
Figure [H Starting with a DNA molecule, the goal is to obtain the sequence of nucleotides 
{A, C, G or T) comprising it. (For humans, the DNA sequence has about 3 x 10^ nucleotides, 
or base pairs.) The sequencing machine extracts a large number of reads from the DNA; 
each read is a randomly located fragment of the DNA sequence, of lengths of the order of 
100-1000 base pairs, depending on the sequencing technology. The number of reads can be 
of the order of lO's to lOO's of millions. The DNA assembly problem is to reconstruct the 
DNA sequence from the many reads. 



When the human genome was sequenced in 2001, there was only one sequencing technol- 
ogy, the Sanger platform [23]. Since 2005, there has been a proliferation of "next generation" 
platforms, including Roche/454, Life Technologies SOLID, Illumina Hi-Seq 2000 and Pacific 
Biosciences RS. Compared to the Sanger platform, these technologies can provide massively 
parallel sequencing, producing far more reads per instrument run and at a lower cost, al- 
though the reads are shorter in lengths. Each of these technologies generates reads of different 
lengths and with different noise profiles. For example, the 454 machines have read lengths of 
about 400 base pairs, while the SOLID machines have read lengths of about 100 base pairs. 
At the same time, there has been a proliferation of a large number of assembly algorithms, 
many tailored to specific sequencing technologies. (Recent survey articles [201 HEl IS] discuss 
no less than 20 such algorithms, and the Wikipedia entry on this topic listed 42 [50].) 

The design of these algorithms is based primarily on computational considerations. The 
goal is to design efficient algorithms that can scale well with the large amount of sequencing 
data. Current algorithms are often tailored to particular machines and are designed based 
on heuristics and domain knowledge regarding the specific DNA being sequenced; this makes 
it difficult to compare different algorithms, not to mention to define what it means by an 
"optimal" assembly algorithm for a given sequencing problem. One reason for the heuristic 
approach taken towards the problem is that various formulations of the assembly problem 
are known to be NP-hard (see for example [TT] ) 

An alternative to the computational view is the information theoretic view. In this view, 
the genome sequence is regarded as a random string to be estimated based on the read data. 
The basic question is: what is the minimum number of reads needed to reconstruct the 
DNA sequence with a given reliability? This minimum number can be used as a benchmark 
to compare different algorithms, and an optimal algorithm is one that achieves this mini- 
mum number. It can also provide an algorithm- independent basis for comparing different 
sequencing technologies and for designing new technologies. 

This information theoretic view falls in the realm of DNA sequencing theory [29J. A well- 
known lower bound on the number of reads needed can be obtained by a coverage analysis, 
an approach pioneered by Lander and Waterman [12]. This lower bound is the number of 
reads A^cov such that with a desired probability, say 1 — e, the randomly located reads cover 



the entire genome sequence. The number A^cov can be easily approximated: 

iVeov(e,G,L)^^ln(-^), 

where G and L are DNA and read length, respectively. While this is clearly a lower bound on 
the minimum number of reads needed, it is in general not tight: only requiring the reads to 
cover the entire genome sequence does not guarantee that consecutive reads can actually be 
stitched back together to recover the entire sequence. The ability to do that depends on other 
factors such as the repeat statistics of the DNA sequence and also the noise profile in the 
read process. Thus, characterizing the minimum number of reads required for reconstruction 
is in general an open question. 

1.2 Main contributions 

In this paper, we make progress on this basic problem. We first focus on a very simple model: 

1. the DNA sequence is modeled as an i.i.d. random process of length G with each symbol 
taking values according to a probability distribution p on the alphabet {A, C, G, T}. 

2. each read is of length L symbols and begins at a uniformly distributed location on the 
DNA sequence and the locations are independent from one read to another. 

3. the read process is noiseless. 

Fix an e G (0, 1/2) and let A^min(e) G*, L) be the minimum number of reads required to 



reconstruct the DNA with probability at least 1 — e. We would like to know how ^ {ec D 



Nmin{e,G,L) 

Ncov{e,G,L) 

behaves in the asymptotic regime when G and L grow to infinity. It turns out that in this 
regime, the ratio depends on G and L through a normalized parameter: 



logG' 
where log(-) represents logarithms to base 2. We define 

N^,^[e,G,L) 



CminiL) = lim. 



G-5>oo,L=LlogG A^cov(e, G, L) 



^2(p) 



Figure 2: The critical phenomenon. 



Let H2{p) be the Renyi entropy of order 2, defined to be 

H,ip):=-\ogY.pl 



(1) 



Our main result, Theorem[T], yields a critical phenomenon: when L is below the threshold 
2/H2{p), reconstruction is impossible, i.e. Cmin{L) = oo, but when L is above that threshold, 
the obvious necessary condition of coverage is also sufficient for reconstruction, i.e. Cinin(-^) = 
1. The significance of the threshold is that when L < 2/H2{p), with high probability there 
are many repeats of length L in the DNA sequence, while when L > 2/H2{p), with high 
probability there are no repeats of length L. Thus, another way to interpret the result is 
that L < 2/H2{p) is a repeat-limited regime while L > 2/H2{p) is a coverage-limited regime. 
The result is summarized in Figure |21 

A standard measure of data requirements in DNA sequencing projects is the coverage 
depth: the average number of reads covering each base pair. Thus, Ncov{^,G,L) x L/G is 
the coverage depth required to cover the DNA sequence with probability 1 — e (as predicted 
by Lander- Waterman) , and A^min(e, C, L) x L/G is the minimum coverage depth required 
to reconstruct the DNA sequence with probability 1 — e. Hence, c^i^{L) can be interpreted 
as the (asymptotic) normalized minimum coverage depth required to reconstruct the DNA 
sequence. 

In a related work, Arratia et al [2] showed that L > 2/7^2 (p) is a necessary and sufficient 
condition for reconstruction of the i.i.d. DNA sequence if all length L subsequences of 
the DNA sequence are given as reads. This arises in a technology called sequencing by 



hybridization. Obviously, for the same read length L, having all length L subsequences 
provides more information than any number of reads from shotgun sequencing, where the 
reads are randomly sampled. Hence, it follows that L > 2/H2{p) is also a necessary condition 
for shotgun sequencing. What our result says is that this condition together with coverage 
is sufficient for reconstruction asymptotically. 

The basic model of i.i.d. DNA sequence and noiseless reads is very simplistic. We provide 
two extensions to our basic result: 1) Markov DNA statistics; 2) noisy reads. In the first 
case, we show that the same result as the i.i.d. case holds except that the Renyi entropy 
H2{p) is replaced by the Renyi entropy rate of the Markov process. In the second case, we 
analyze the performance of a modification of the greedy algorithm to deal with noisy reads, 
and show that the effect of noise is on increasing the threshold on the read length below 
which reconstruction is impossible. 

Even with these extensions, our models still miss several important aspects of real DNA 
and read data. Perhaps the most important aspect is the presence of long repeats in the 
DNA sequences of many organisms, ranging from bacteria to humans. These long repeats 
are poorly captured by i.i.d. or even Markov models due to their short-range correlation. 
Another aspect is the non-uniformity of the sampling of reads from the DNA sequences. At 
the end of the paper, we will discuss how our results can be used as a foundation to tackle 
these and other issues. 

1.3 Related Work 

Li [13] has also posed the question of minimum number of reads for the i.i.d. equiprobable 
DNA sequence model. He showed that if L > 41og(j, then the number of reads needed 
is 0{G/LlnG), i.e. a constant multiple of the number needed for coverage. Specializing 
to the equiprobable case, our result shows that reconstruction is possible with probability 
1 — e if and only if L > logG and the number of reads is G/Lln{G/Le). Not only is our 
characterization necessary and sufficient, we have a much weaker condition on the read length 
L, and we get the correct pre-log constant on the number of reads needed. As will be seen 
later, many different algorithms have the same scaling behavior in the number of reads they 
need, but it is the pre-log constant which distinguishes them. 
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A common formulation of DNA assembly is the shortest common superstring (SCS) 
problem. The SCS problem is the problem of finding the shortest string containing a set 
of strings, where in the DNA assembly context, the given strings are the reads and the 
superstring is the estimate of the original DNA sequence. While the general SCS problem 
with arbitrary instances is NP-hard [TT], the greedy algorithm has been shown to be optimal 
for the SCS problem under certain probabilistic settings |7l [H]. Thus, the reader may 
have the impression that our results overlap with these previous works. However, there are 
significant differences. 

First, at a basic problem formulation level, the SCS problem and the DNA sequence 
reconstruction problem are not equivalent: there is no guarantee that the shortest common 
superstring containing the given reads is the original DNA sequence. Indeed, it has already 
been observed in the assembly literature (eg. [IS]) that the shortest common superstring 
of the reads may be a significant compression of the original DNA sequence, especially 
when the latter has a lot of repeats, since finding the shortest common superstring tends 
to merge these repeats. For example, in the case of very short reads the resulting shortest 
common superstring is definitely not the original DNA sequence. In contrast, we formulate 
the problem directly in terms of reconstructing the original sequence, and a lower bound on 
the required read length emerges as part of the result. 

Second, even if we assume that the shortest common superstring containing the reads is 
the original DNA sequence, one cannot recover our result from either [2] or [7], for different 
reasons. The main result (Theorem 1) in [14J says that if one models the DNA sequence as 
an arbitrary sequence perturbed by mutating each symbol independently with probability 
p and the reads are arbitrarily located, the average length of the sequence output by the 
greedy algorithm is no more than a factor of 1 + 35 of the length of the shortest common 
superstring, provided that p > 2\og{GL)/{SL), i.e. p > 2/{SL). However, since p < I, the 
condition on p in their theorem implies that ^ > f- Thus, for a fixed L they actually only 
showed that the greedy algorithm is approximately optimal to within a factor of 1 + Q/L, 
and optimal only under the further condition that L — )■ oo. In contrast, our result shows 
that the greedy algorithm is optimal for any L > 2/H2{p), albeit under a weaker model for 
the DNA sequence (i.i.d. or Markov) and read locations (uniform random). 



Regarding [7J, the probabilistic model they used does not capture the essence of the DNA 
sequencing problem. In their model, the given reads are all independently distributed and not 
from a single "mother" sequence, as in our model. In contrast, in our model, even though the 
original DNA sequence is assumed to be i.i.d., the reads will be highly correlated, since many 
of the reads will be physically overlapping. In fact, it follows from [7] that, given N reads and 
the read length L scaling like log A^, the length of the shortest common superstring scales like 
A^ log A^. On the other hand, in our model, the length of the reconstructed sequence would be 
proportional to A^. Hence, the length of the shortest common superstring is much longer for 
the model studied in [^, a consequence of the reads being independent and therefore much 
harder to merge. So the two problems are completely different, although coincident ally the 
greedy algorithm is optimal for both problems. 

1.4 Notations and Outline 

A brief remark on notation is in order. Sets (and probabilistic events) are denoted by 
calligraphic type, e.g. A,B,£, vectors by boldface, e.g. s, x, y, and random variables by 
capital letters such as S, X, Y. Random vectors are denoted by capital boldface, such as 
S, X, Y. The exception to these rules, for the sake of consistency with the literature, are the 
(non-random) parameters G, N, and L. The natural logarithm is denoted by ln( ■ ) and the 
base 2 logarithm by log( ■ ). 

The rest of the paper is organized as follows. Section 12.11 gives the precise formulation 
of the problem. Section 12.21 explains why reconstruction is impossible for read length below 
the stated threshold. For read length above the threshold, an optimal algorithm is presented 
in Section 12. 3[ where a heuristic argument is given to explain why it performs optimally. 
Sections [3] and S] describe extensions of our basic result to incorporate read noise and a more 
complex model for DNA statistics, respectively. Section discusses future work. Appendices 
contain the formal proofs of all the results in the paper. 




Figure 3: A circular DNA sequence which is sampled randomly. 

2 I.i.d. DNA Model 

This section states the main result of this paper, addressing the optimal assembly of i.i.d. 
DNA sequences. We first formulate the problem and state the result. Next, we compare the 
performance of the optimal algorithm with that of other existing algorithms. Finally, we 
discuss the computational complexity of the algorithm. 

2.1 Formulation and Result 

The DNA sequence s = siS2 ■ ■ ■ sg is modeled as an i.i.d. random process of length G 
with each symbol taking values according to a probability distribution p = {pi,P2,P3,P4) 
on the alphabet {A,C,G,T}. For notational convenience we instead denote the letters by 
numerals, i.e. Si G {1, 2, 3, 4}. To avoid boundary effects, we assume that the DNA sequence 
is circular, i.e., s, = Sj ii i = j mod G; this simplifies the exposition, and all results apply 
with appropriate minor modification to the non-circular case as well. 

The objective of DNA sequencing is to reconstruct the whole sequence s based on A^ 
reads drawn randomly from the sequence (see Figure E])- A read is a substring of length L 
from the DNA sequence. The set of reads is denoted by 7^ = {ri, r2, . . . , rjv}. The starting 
location of read i is tj, so r, = s[tj,tj + L — 1]. The set of starting locations of the reads 
is denoted T = {ti,t2, ■ ■ ■ yt^}, where we assume I < ti < t2 < ■ ■ ■ < tN < G. We also 
assume that the starting location of each read is uniformly distributed on the DNA and the 
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locations are independent from one read to another. 

An assembly algorithm takes a set of A^ reads TZ = {ri, . . . , r^} and returns an estimated 
sequence s = s(7^). We require perfect reconstruction, which presumes that the algorithm 
makes an error if s 7^ 4j- We let P denote the probability model for the (random) DNA 
sequence S and the sample locations T, and S := {S ^ S} the error event. A question of 
central interest is: what are the conditions on the read length L and the number of reads A^ 
such that the reconstruction error probability is less than a given target e? Unfortunately, 
this is in general a difficult question to answer. We instead ask an easier asymptotic question: 
what is the ratio of the minimum number of reads A^min and number of reads needed to cover 
the sequence A'cov as L, G — )■ 00 with L = L/logG being a constant, and which algorithm 
achieves the optimal performance asymptotically? More specifically, we are interested in 
Cmin{L), which is defined as 

Cmin(L) = lim_ ( n T\ - (2) 

G^oo,L=L\ogG Acov(,e, G, L) 

The main result for this model is: 
Theorem 1. Fix an e < 1/2. The minimum normalized coverage depth Cmin{L) is given by 

joo ifL<2/H2ip), 
[1 ifL>2/H,{p), 
where -f^2(p) is the Renyi entropy of order 2 defined in ([T]). 

Section 12.21 proves the first part of the theorem, that reconstruction is impossible for 
L < 2/H2{p). Section [231 shows how a simple greedy algorithm can achieve optimality for 
L > 2/H2{p). 

2.2 L < jA-^' Repeat-limited regime 

The random nature of the DNA sequence gives rise to a variety of patterns. The key obser- 
vation in [22] is that there exist two patterns in the DNA sequence precluding reconstruction 



"'^The notion of perfect reconstruction can be thought of as a mathematical ideahzation of the notion of 
"finishing" a sequencing project as defined by the National Human Genome Research Institute [IT], where 
finishing a chromosome requires at least 95% of the chromosome to be represented by a contiguous sequence. 
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Figure 4: Two pairs of interleaved repeats of length L — 1 create ambiguity: from the reads 
it is impossible to know whether the sequences x and y are as shown, or swapped. 

from an arbitrary set of reads of length L. In other words, reconstruction is not possible even 
if the L-spectrum, i.e. the set of all substrings of length L appearing in the DNA sequence, 
is given. The first pattern is the three way repeat of a substring of length L — 1. The second 
pattern is two interleaved pairs of repeats of length L — 1, shown in Figure HI Arratia et al. 
[2J carried out a thorough analysis of randomly occurring repeats for the same i.i.d. DNA 
model as ours, and showed that the second pattern of two iterleaved repeats is the typical 
event for reconstruction to fail. A consequence of Theorem 7 in |2j is the following lemma 
(see also [B]). 

Lemma 2 (Arratia et al. P]). Fix L < -jj^-^- An i.i.d. random DNA sequence contains 
interleaved repeats of length L = LlogG with probability 1 — o(l). 

We give a heuristic argument for the lemma, following [2] . As shown below, the expected 
number of length-L repeats in an i.i.d. sequence has a rather sharp transition, from almost 
none to many, as L decreases below jj^r^- It turns out that the positions of the repeats are 
approximately uniformly distributed throughout the sequence, so if there are many repeats, 
then it is likely that at least two of them are interleaved. 

We proceed with computing the expected number of repeats. Denoting by Sf the length- 
L subsequence starting at position i, we have 

E[# of length L repeats] = J2 ^(Sf = ^f) ■ (4) 

^<i<j<G 

Now, the probability that two specific physically disjoint length-^ subsequences are identical 
is 



2^ ^^~eH2{p) 



i 

where -ff2(p) = — log ( Z^iPf ) is the Renyi entropy of order 2. Ignoring the GL terms in 
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where Sf and Sf overlap, we get a lower bound, 



E[# of repeats] > ( — " gA e^^^^^P) « — e-^^''^^\ (5) 

This number approaches zero if L > 2/H2{p) and it approaches infinity if L < 2/H2{p). 

Hence, if L < 2/H2{p), then the probability of having two pairs of interleaved repeats 
is very high. Moreover, as a consequence of Lemma [T^ in Appendix ??, the contribution of 
the terms in (jlj) due to physically overlapping subsequences is not large, and so the lower 
bound in i^ is essentially tight. This suggests that L = 2/H2{p) is in fact the threshold for 
existence of interleaved repeats. 
Proof of Theorem [T], Part 1: 

Proof. The probability of observing a sequence of reads ri, . . . , r^v given a DNA sequence s 

is: 

^, , N -TT^/ , N -/t # of occurrences of rj in s 

P(ri, . . . , r^|s) = n nr.ls) = H 7^ ' • 

Now suppose the DNA sequence s has two interleaved repeats of length L — 1 as in 
Figure H] and let s' be the sequence with the subsequences x and y swapped. Then, the 
number of occurrences of each read rj in s and s' is the same and hence 

P(ri,...,rjv|s)=P(ri,...,rjv|s'). 

Moreover, P(s) = P(s'). Hence 

P(s|ri,...,r,v)=P(s'|ri,...,r^). 

Thus, the optimal MAP rule will have a probability of reconstruction error of at least 1/2 
conditional on the DNA sequence having interleaved repeats of length L — 1, regardless of the 
number of reads. By Lemma [2], this latter event has probability approaching 1 as G — > oo 
if Z < 2/H2{p). Since e < 1/2, this implies that for sufficiently large G, A^min(e, G, L) = oo, 
thus proving the result. 

D 

Note that for any fixed read length L, the probability of the interleaved repeat event will 
approach 1 as the DNA length G -^ oo. This means that if we had defined the minimum 
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Figure 5: The reads must cover the sequence. 

normahzed coverage depth for a fixed read length L, then for any value of L the minimum 
normalized coverage depth would have been cxd. Thus, to get a meaningful result, one must 
scale L with G, and Lemma [2] suggests that letting L and G grow while fixing L is the correct 
scaling. 

2.3 L > -jpH:'' Coverage-limited regime 

In order to reconstruct the DNA sequence it is necessary to observe each of the nucleotides, 
i.e. the reads must cover the sequence (see Figure E]). Worse than the missing nucleotides, 
a gap in coverage also creates ambiguity in the order of the contiguous pieces. Thus, 
A^cov(e, G, L), the minimum number of reads to cover the entire DNA sequence with proba- 
bility 1 — e, is a lower bound to A^mm(e, G, L), the minimum number of reads to reconstruct 
with probability 1 — e. The paper of Lander and Waterman [12] studied the coverage prob- 
lem in the context of DNA sequencing, and from their results, one can deduce the following 
asymptotics for Nco^ie, G, L). 



Lemma 3. For any e E (0, 1^ 



.i,„ ^-<^-^-^' = 1. 



L, G^oo, L/ log G=L G/L 

A standard coupon collector-style argument proves this lemma in [T2|. An intuitive 
justification of the lemma, which will be useful in the sequel, is as follows. To a very good 
approximation, the starting locations of the reads are given according to a Poisson process 
with rate A = N/G, and thus each spacing has an exponential(A) distribution. Hence, the 
probability that there is a gap between two successive reads is approximately e~'^^. Hence, 
the expected number of gaps is approximately: 
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Asymptotically, this quantity is bounded away from zero if A^ < G/ L, and approaches zero 
otherwise. 

We show that for L > 2/H2{p), a simple greedy algorithm (perhaps surprisingly) attains 
the coverage lower bound. Essentially, the greedy algorithm merges the reads repeatedly into 
contigsa, and the merging is done greedily, according to an overlap score defined on pairs of 
strings. For a given score the algorithm is as follows. 

Greedy Algorithm: Input: TZ, the set of reads of length L. 

1. Initialize the set of contigs as the given reads. 

2. Find two contigs with largest overlap score, breaking ties arbitrarily, and merge them 
into one contig. 

3. Repeat Step 2 until only one contig remains. 

For the i.i.d. DNA model and noiseless reads, we use the overlap score W^(si, S2), defined as 
the length of the longest suffix of Si identical to a prefix of S2. 

Showing optimality of the greedy algorithm entails showing that if the reads cover the 
DNA sequence and there are no repeats of length L, then the greedy algorithm can recon- 
struct the DNA sequence. In the remainder of this subsection we heuristically explain the 
result, and we give a detailed proof in Appendix??. 

Since the greedy algorithm merges reads according to overlap score, we may think of the 
algorithm as working in stages, starting with an overlap score of L down to an overlap score 
of 0. At stage i, the merging is between contigs with overlap score i. The key is to find the 
typical stage at which the first error in merging occurs. Assuming no errors have occurred 
in stages L,L — 1, . . . ,i + 1, consider the situation in stage £, as depicted in Figure O The 
algorithm has already merged the reads into a number of contigs. The boundary between 
two neighboring contigs is where the overlap between the neighboring reads is less than or 
equal to i; if it were larger than i, the two contigs would have been merged already. Hence, 
the expected number of contigs at stage £ is the expected number of pairs of successive reads 



^Here, a contig means a contiguous fragment formed by overlapping sequenced reads. 
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Figure 6: The greedy algorithm merges reads into contigs according to the amount of overlap. 
At stage i the algorithm has already merged all reads with overlap greater than i. The red 
segment denotes a read at the boundary of two contigs; the neighboring read must be offset 
by at least L — i. 

with spacing greater than L — i. Again invoking the Poisson approximation, this is roughly 
equal to 

where A = N/G. 

Two contigs will be merged in error in stage i if the length i suffix of one contig equals 
the length i prefix of another contig from a different location. Assuming these substrings 
are physically disjoint, the probability of this event is: 

2-iH2{p)_ 

Hence, the expected number of pairs of contigs for which this confusion event happens is 
approximately: 



Are-^(^-^) 



-iH2{p) 



(6) 



This number is largest either when i = L or i = 0. This suggests that, typically, errors 
occurs in stage L or stage of the algorithm. Errors occur at stage L if there are repeats of 
length L substrings in the DNA sequence. Errors occur at stage if there are still leftover 
unmerged contigs. The no-repeat condition ensures that the probability of the former event 
is small and the coverage condition ensures that the probability of the latter event is small. 
Hence, the two necessary conditions are also sufficient for reconstruction. 
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2.4 Performance of Existing Algorithms 

The greedy algorithm was used by several of the most widely used genome assemblers for 
Sanger data, such as phrap, TIGR Assembler [23] and CAPS [8]. More recent software 
aimed at assembling short-read sequencing data uses different algorithms. We will evaluate 
the normalized coverage depth of some of these algorithms on our basic statistical model 
and compare them to the information theoretic limit. The goal is not to compare between 
different algorithms; that would have been unfair since they are mainly designed for more 
complex scenarios including noisy reads and repeats in the DNA sequence. Rather, the aim 
is to illustrate our information theoretic framework and make some contact with existing 
assembly algorithm literature. 

2.4.1 Sequential Algorithm 

By merging reads with the largest overlap first, the greedy algorithm discussed above effec- 
tively grows the contigs in parallel. An alternative greedy strategy, used by software like 
SSAKE |2H], VCAKE [I^ and SHARCGS 0, grows one contig sequentially. An unassembled 
read is chosen to start a contig, which is then repeatedly extended (say towards the right) 
by identifying reads that have the largest overlap with the contig until no more extension is 
possible. The algorithm succeeds if the final contig is the original DNA sequence. 
The following proposition gives the normalized coverage depth of this algorithm. 

Proposition 4. The minimum normalized coverage depth for the sequential algorithm is 

The result is plotted in Fig. [71 The performance is strictly worse than that of the greedy 
algorithm. We give only a heuristic argument for Proposition HI 

Motivated by the discussion in the previous section, we seek the typical overlap i at 
which the first error occurs in merging a read; unlike the greedy algorithm, where this 
overlap corresponds to a specific stage of the algorithm, for the sequential algorithm this 
error can occur anytime between the first and last merging. 

Let us compute the expected number of pairs of reads which can be merged in error 
at overlap i. To begin, a read has the potential to be merged to an incorrect successor at 

16 



CiiT— mer 

Cgeq 

1 



^2(p) 



Figure 7: The minimum normalized coverage depth obtained by the sequential algorithm is 
in the middle, given by Cgeq = ; J'y^l ° , the minimum normalized coverage depth obtained 
by the i^-mers based algorithm is at top, given by cx-mc 



LH2{p)~2- 



overlap i if it has overlap less than or equal to i with its true successor, since otherwise 
the sequential algorithm discovers the read's true successor. By the Poisson approximation, 
there are roughly Ne~^^^~^' reads with physical overlap less than or equal to i. with their 
successors. In particular, if £ < L — A^^lnA^ there will be no such reads, and so we may 
assume that d, lies between L — A^^ In A^ and L. 

Note furthermore that in order for an error to occur, the second read must not yet have 
been merged when the algorithm encounters the first read, and thus the second read must 
be positioned later in the sequence. This adds a factor one-half. Combining this reasoning 
with the preceding paragraph, we see that there are approximately 

pairs of reads which may potentially be merged incorrectly at overlap L 

For such a pair, an erroneous merging actually occurs if the length-^ suffix of the first read 

equals the length-^ prefix of the second. Assuming (as in the greedy algorithm calculation) 

that these substrings are physically disjoint, the probability of this event is 2^^-^2(p)_ 

The expected number of pairs of reads which are merged in error at overlap ^, for L — 

A^^ In A^ < £ < L, is thus approximately 
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This number is largest when i = LoTi = L — X ^ In A^, so the expression in ([7]) approaches 

N ^ LH2{p)ln2 
A^cov LH2{p)-l 



zero if and only if -j^ > ? J^'f! " and L > 2/H2{p), as in Proposition HI 



2.4.2 i^-mer based Algorithms 

Due to complexity considerations, many recent assembly algorithms operate on K-mers 
instead of directly on the reads themselves. K-meis are length K subsequences of the reads; 
from each read, one can generate L—K+1 K-mers. One of the early works which pioneer this 
approach is the sort-and-extend technique in ARACHNE [22]. By lexicographically sorting 
the set of all the K-raers generated from the collection of reads, identical i^-mers from 
physically overlapping reads will be adjacent to each other. This enables the overlap relation 
between the reads (so called overlap graph) to be computed in 0(A^log A^) time (time to sort 
the set of i^-mers) as opposed to the 0(A^^) time needed if pairwise comparisons between 
the reads were done. Another related approach is the De Brujin graph approach [HI [IS]. In 
this approach, the i^-mers are represented as vertices of a De Brujin graph and there is an 
edge between two vertices if they represent adjacent i^-mers in some read (here adjacency 
means their positions are offset by one). The DNA sequence reconstruction problem is then 
formulated as computing an Eulerian cycle traversing all the edges of the De Brujin graph. 

The performance of these algorithms on the basic statistical model can be analyzed by 
observing that two conditions must be satisfied for them to work. 

First, K should be chosen such that with high probability, K-mers from physically dis- 
joint parts of the DNA sequence should be distinct, i.e. there are no repeats of length K 
subsequences in the DNA sequence. In the sort-and-extend technique, this will ensure that 
two identical adjacent i^-mers in the sorted list belong to two physically overlapping reads 
rather than two physically disjoint reads. In the De Brujin graph approach, this will ensure 
that the Eulerian path will be connecting K-mers that are physically overlapping. This 
minimum K can be calculated as we did to justify Lemma [2J 

K 2 

> 



logG H^ip) ■ 

Second, all successive reads should have physical overlap of at least K base pairs. This 
is needed so that the reads can be assembled via the K-mers. According to the Poisson 



approximation, the expected number of successive reads with spacing greater than L — K 
base pairs is roughly Ne^^^^~^' . To ensure that with high probabihty all successive reads 
have overlap at least K base pairs, this expected number should be small, i.e. 

^^ GlniV GlnG 

Substituting Eq. ([H]) into this and using the definition L = L/ logG, we obtain 

> 



N,,, LH2{p) - 2 

The minimum normalized coverage depth of this algorithm is plotted in Figure [71 Note 
that the performance the K-mer based algorithms is strictly less than the performance 
achieved by the greedy algorithm. The reason is that for L > 2/H2{p), while the greedy 
algorithm only requires the reads to cover the DNA sequence, the K-mer based algorithms 
need more, that successive reads have (normalized) overlap at least 2/H2{p). 

2.5 Complexity of the Greedy Algorithm 

A naive implementation of the greedy algorithm would require an all-to-all pairwise compar- 
ison between all the reads. This would require a complexity of 0{N^) comparisons. For N 
in the order of tens of millions, this is not acceptable. However, drawing inspiration from the 
sort-and-extend technique discussed in the previous section, a more clever implementation 
would yield a complexity of 0{LN log N). Since L <^ N, this is a much more efficient im- 
plementation. Recall that in stage i of the greedy algorithm, successive reads with overlap 
i are considered. Instead of doing many pairwise comparisons to obtain such reads, one can 
simply extract all the i-mers from the reads and perform a sort-and-extend to find all the 
reads with overlap i. Since we have to apply sort-and-extend in each stage of the algorithm, 
the total complexity is 0{LN log N). 

An idea similar to this and resulting in the same complexity was described by Turner [2S] 
(in the context of the shortest common superstring problem), with the sorting effectively 
replaced with a suffix tree data structure. Ukkonen [27] used a more sophisticated data 
structure, which essentially computes overlaps between strings in parallel, to reduce the 
complexity to 0{NL). 
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3 Markov DNA Model 

In this section we extend the results for the basic i.i.d. DNA sequence model to a Markov 
sequence model. 

3.1 Formulation and Result 

The problem formulation is identical to the one in Section 12.11 except that we assume the 
DNA sequence is correlated and model it by a Markov source with transition matrix Q = 
fei]«je{i,2,3,4}, where qij = ¥{Sk = i\Sk-i = j). 

Remark 5. We assume that the DNA is a Markov process of order 1, but the result can be 
generalized to Markov processes of order m as long as m is constant and does not grow with 
G. 

In the basic i.i.d. model, we observed that the minimum normalized coverage depth de- 
pends on the DNA statistics through the Renyi entropy of order 2. We prove that a similar 
dependency holds for Markov models. In |2Xj, it is shown that the Renyi entropy rate of 
order 2 for a stationary ergodic Markov source with transition matrix Q is given by 

\Pmax(Q)/ 

where Pmax(<5) = niax{|p| : p eigenvalue of Q}, and Q = [q'fJijGli, 2,3,4}- In terms of this 
quantity, we state the following theorem. 

Theorem 6. The minimum normalized coverage depth of a stationary ergodic Markov DNA 
sequence is given by 

foo ifl<2/H,{Q), 

CmUL) = \ (10) 

[1 ifL>2/H2{Q). 

3.2 Sketch of Proof 

Similar to the i.i.d. case, it suffices to show the following statements: 
1. If L < jfTQ), A^mm(e, G,L) = 00 for sufficiently large G. 
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2. If L > j^^, then Cmin = 1- 

The following lemma is the analogue of Lemma [2] for the Markov case, and is used 
similarly to prove statement 1. 

Lemma 7. If L < 2/H2{Q), then a Markov DNA sequence contains interleaved repeats with 
probability 1 — o(l). 

To justify Lemma [7] we use a similar heuristic argument as for the i.i.d. model, but with 
a new value for the probability that two physically disjoint sequences Sf and Sj" are equal: 

The lemma follows from the fact that there are roughly G^ such pairs in the DNA sequence. 
A formal proof of the lemma is provided in Appendix IB. II 

Statement 2 is again a consequence of the optimality of the greedy algorithm, as shown 
in the following lemma. 

Lemma 8. The greedy algorithm with exactly the same overlap score as used for the i.i.d. 
model can achieve minimum normalized coverage depth Cmin = 1 if L > TfTnT- 

Lemma [8] is proved in Appendix IB. 21 The key technical contribution of this result is to 
show that the effect of physically overlapping reads does not affect the asymptotic perfor- 
mance of the algorithm, just as in the i.i.d. case. 

4 Noisy Reads 

In our basic model, we assumed that the read process is noiseless. In this section, we assess 
the effect of noise on the greedy algorithm. 

4.1 Formulation and Result 

The problem formulation here differs from Section fI7I\ in two aspects. First, we assume that 
the read process is noisy and consider a simple probabilistic model for the noise. A nucleotide 
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s is read to be r with probability Q{r\s). Each nucleotide is perturbed independently, i.e. if 
r is a read from the physical underlying subsequence s of the DNA sequence, then 

L 

Fir\s) = II Qin\si). 
1=1 

Moreover, it is assumed that the noise affecting different reads is independent. 

Second, we require a weaker notion of reconstruction. Instead of perfect reconstruction, 

we aim for perfect layout. By perfect layout, we mean that all the reads are mapped correctly 

to their true locations. Note that perfect layout does not imply perfect reconstruction as the 

consensus sequence may not be identical to the DNA sequence. On the other hand, since 

coverage implies that most positions on the DNA are covered by many reads {0{logG), to be 

more precise), the consensus sequence will be correct in most positions if we achieve perfect 

layout. 

Remark 9. In the jargon of information theory, we are modeling the noise in the read process 
as a discrete memoryless channel with transition probability Q{-\-). Noise processes in actual 
sequencing technologies can be more complex than this model. For example, the amount of 
noise can increase as the read process proceeds, or there may be insertions and deletions in 
addition to substitutions. Nevertheless, understanding the effect of noise on the assembly 
problem in this model provides considerable insight to the problem. 

We now evaluate the performance of the greedy algorithm for the noisy read problem. 
Finding the optimal algorithm for this case is an open problem. 

To tailor the greedy algorithm for the noisy reads, the only requirement is to define the 
overlap score between two distinct reads. Given two reads r^ and Tj, we would like to know 
whether they are physically overlapping with length i. Let X and Y of length i be the suffix 
of Ti and prefix of r^, respectively. We have the following hypotheses for X and Y: 

• Hq: X. and Y are noisy reads from the same physical source subsequence; 

• Hi. X. and Y are noisy reads from two disjoint source subsequences. 

The decision rule that is optimal in trading off the two types of error is the maximum a 
posteriori (MAP) rule, obtained by a standard large deviations calculation (see for example 
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Chapter 11.7 and 11.9 of [4J.) In log likelihood form, the MAP rule for this hypothesis testing 
problem is: 

Decide Ho if log ^g]^ = EU l^S ^^feg) > ^^' (H) 

where Px,y{x, y), Px{x) and Pyiy) are the marginals of the joint distribution Ps{s)Q{x\s)Q{ii\s), 
and ^ is a parameter reflecting the prior distribution of Hq and Hi. 

We can now define the overlap score, whereby two reads Rj and Rj have overlap at least 
C if the MAP rule on the length d, suffix of Rj and the length d, prefix of read Rj decides Hq. 
The performance of the greedy algorithm using this score is given in the following theorem. 

Theorem 10. The modified greedy algorithm can achieve normalized coverage depth c{L) = 1 
if L > 2/r , where 

r=maxmm{2D{Pj\Px,Y),D{P^\\Px-PY)),, 



and the distribution P^ is given by 



[PxA^,y)nPxix)Pyiy)] 



1-M 



with 11 the solution to the equation 

D{P^\\Px ■ Py) - D{P^\\Px,y) = e. 

The statement of Theorem [TUl uses the KL Divergence D{P\\Q) of the distribution P 
relative to Q, defined as 

DiP\\Q) = ^P(a)^o,^^. (12) 

The details of the proof of the theorem are in Appendix O To illustrate the main ideas, we 
sketch the proof for the special case of uniform source and symmetric noise. 

4.2 Sketch of Proof for Uniform Source and Symmetric Noise 

In this section, we provide an argument to justify Theorem [TUl in the case of uniform source 
and symmetric noise. Concretely, p = (1/4, 1/4, 1/4, 1/4) and the noise is symmetric with 
transition probabilities: 

{1 — e if z = 7 
(13) 
e/3 ifz^j. 
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H2{p) I'{e) ^ 

Figure 8: The performance of the modified greedy algorithm with noisy reads. 

The parameter e is often called the error rate of the read process. It ranges from 1% to 10% 
depending on the sequencing technology. 

Corollary 11. The greedy algorithm with the modified definition of overlap score between 
reads can achieve normalized coverage depth c{L) = 1 if L > 2//*(e), where 



r(e) =D(a 



*l|3^ 



and a* satisfies 



D(a*|||)=2D(a*||2e-|e2). 



Here, D{a\\P) is the divergence between a Bern(a) and a Bern(/3) random variable. 

Proof. The proof follows by applying Theorem [TUl For uniform source and symmetric noise, 
the optimum /* is attained when 2D(P^||Px,y) = D{P^\\Px ■ Py)- The result is written in 
terms of a* which is a function of the optimal value 6*. D 

The performance of this algorithm is shown in Figure [HI The only difference between the 
two is a larger threshold, 2//*(e) at which the minimum normalized coverage depth becomes 
one. A plot of this threshold as a function of e is shown in Figure M It can be seen that 
when e = 0, 2//*(e) = 2/H2{p) = 1, and increases continuously as e increases. 

We justify the corollary by the following argument. In the noiseless case, two reads 
overlap by at least i if the length i prefix of one read is identical to the length i suffix of 
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Figure 9: Plot of jq^ as a function of e for the uniform source and symmetric noise model. 



the other read. The overlap score is the largest such i. When there is noise, this criterion is 
not appropriate. Instead, a natural modification of this definition is that two reads overlap 
by at least £ if the Hamming distance between the prefix and the suffix strings is less than 
a fraction a of the length i. The overlap score between the two reads is the largest such 
i. The parameter a controls how stringent the overlap criterion is. By optimizing over the 
value of a, we can obtain the following result. 

We picture the greedy algorithm as working in stages, starting with an overlap score of L 
down to an overlap score of 0. Since the spacing between reads is independent of the DNA 
sequence and noise process, the number of reads at stage £ given no errors have occurred in 
previous stages is again roughly 

To pass this stage without making an error, the greedy algorithm should correctly merge 
those reads having spacing of length i to their successors. Similar to the noiseless case, the 
greedy algorithm makes an error if the overlap score between two non-consecutive reads is i 
at stage i, in other words 

1. The Hamming distance between the length i suffix of the present read and the length 
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C. prefix of some read wliich is not the successor is less than ad by random chance. 

A standard large deviations calculation shows that the probability of this event is approxi- 
mately 

2-e-D(a\\l)^ 

which is the probability that two independent strings of length C. have Hamming distance 
less than aL Hence, the expected number of pairs of contigs for which this confusion event 
happens is approximately 

Unlike the noiseless case, however, there is another important event affecting the perfor- 
mance of the algorithm. The missed detection event is defined as 

2. The Hamming distance between the length (, suffix of the present read and the length 
£ prefix of the successor read is larger than al due to an excessive amount of noise. 

Again, a standard large deviations calculation shows that the probability of this event for a 
given read is approximately 

where rj = 2e — |e^ is the probability that the ith symbol in the length i suffix of the present 
read does not match the ith symbol in the length i prefix of the successor read (here we are 
assuming that a > rj). Thus the expected number of contigs missing their successor contig 
at stage £ is approximately 

Ne-^(L-E)2-eDia\\rj)_ (15) 



Both Equations ( 1141) and ( 1151) are largest either when i = L or i = 0. Similarly to the 
noiseless case, errors do not occur at stage if the DNA sequence is covered by the reads. 
The coverage condition guarantees no gap exists in the assembled sequence. From (IT^ 
and (IT^ we see that no errors occur at stage L if 

L 2 

> 



logG mm{2D{a\\r]),D{a\\l)) 

Selecting a to minimize the right hand side results in the two quantities within the minimum 
being equal, which gives the result. 
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5 Discussions and Future Work 

This paper seeks to understand the basic data requirements for shotgun sequencing, and we 
obtain resuhs for simple models. The models for the DNA sequence and read process in this 
paper serve as a starting point from which to pursue extensions to more realistic models. 
We discuss a few of the many possible extensions. 

Long repeats. Long repeats occur in many genomes, from bacteria to human. The 
repetitive nature of real genomes is understood to be a major bottleneck for sequence as- 
sembly. Thus a caveat of this paper is that the DNA sequence models we have considered, 
both i.i.d. and Markov, exhibit only short-range correlations, and therefore fail to capture 
the long-range correlation present in complex genomes. Motivated by this issue, a follow-up 
work [3] extends the approach of this paper to arbitrary repeat statistics, in particular the 
statistics of actual genomes. The read model considered in [5] is the same uniform noiseless 
model we consider. 

We briefly summarize the results and approach of [3]. First, Ukkonen's condition that 
there be no interleaved or triple repeats of length at least L — 1 is generalized to give a 
lower bound on the read length and the coverage depth required for reconstruction in terms 
of repeat statistics of the genome. Next, they design a de Brujin graph based assembly 
algorithm that can achieve very close to the lower bound for repeat statistics of a wide 
range of sequenced genomes. The approach results in a pipeline, which takes as input 
a genome sequence and desired success probability 1 — e, computes a few simple repeat 
statistics, and from these statistics computes a feasibility plot that indicates for which L and 
N reconstruction is possible. 

Double-strandedness. The DNA sequence is double stranded and consists of a sequence 
s and its reverse complement s. Reads are obtained from either of the two strands, and a 
natural concern is whether this affects the results. It turns out that for the i.i.d. sequence 
model considered in this paper (as well as the Markov model), the asymptotic minimum 
normalized coverage depth remains the same. The optimal greedy algorithm is modified 
slightly by including the reverse complements of the reads as well as the originals, and 
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stopping when there are two reconstructed sequences s and s. The heuristic argument 
follows by observing that the probability of error at stage £ given in ([6]) is changed only 
be a factor two, which does not change the asymptotic result. The rigorous proof involves 
showing that the contribution from overlapping reads is negligible, where the notion of reads 
overlapping accounts for both the sequence and its reverse complement. 

Read process. There are a number of important properties of the read process which 
can be incorporated into more accurate models. Beyond the substitution noise considered in 
this paper, some sequencing technologies (such as PacBio) produce insertions and deletions. 
Often bases come with quality scores, and these scores can be used to mitigate the effect of 
noise. Other interesting aspects include correlation in the noise from one base to another 
(e.g. typically producing several errors in a row), non- uniformity of the error rate within 
a read, and correlation of the noise process with the read content. Aside from noise, a 
serious practical difficulty arises due to the positions of reads produced by some sequencing 
platforms being biased by the sequence, e.g. by the GC content. Noise and sampling bias 
in the reads make assembly more difficult, but another important direction is to incorporate 
mate-pairs into the read model. Mate-pairs (or paired-end reads) consisting of two reads 
with an approximately known separation, help to resolve long repeats using short reads. 

Partial reconstruction. In practice the necessary conditions for perfect reconstruct are 
not always satisfied, but it is still desirable to produce the best possible assembly. While 
the notion of perfect reconstruction is relatively simple, defining what "best" means is more 
delicate for partial reconstructions; one must allow for multiple contigs in the output as well 
as errors (misjoins). Thus an optimal algorithm is one which trades off optimally between 
the number of contigs and number of errors. 
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A Proof of Theorem [T], Part 2 

We first state and prove the following lemma. This result can be found in [2], but for ease 
of generalization to the Markov case later, we will include the proof. 

Lemma 12. For any distinct substrings X and Y of length i of the i.i.d. DNA sequence: 

1. If the strings have no physical overlap, the probability that they are identical is e~^^^^^\ 

2. If the strings have physical overlap, the probability that they are identical is bounded 
above by e'^^^^^^/^ . 

Proof. We first note that for any k distinct bases in the DNA sequence the probability that 
they are identical is given by 

4 
A v^ k 

^k = 2^Pi- 

1- Consider X = Si^i . . . Si+i and Y = 5*^+1 . . . 5*^+^ have no physical overlap. In this 
case, the events {Sj+m = Sj+m} for m G {!,...,£} are independents and equiprobable. 
Therefore, the probability that X = Y is given by 



TTn 



2-eH2{p)_ 



2- For the case of overlapping strings X and Y, we assume that a substring of length 
k < i from the DNA sequence is shared between the two strings. Without loss of generality, 
we also assume that X and Y are, respectively, the prefix and suffix of S^^"*'. Let q and r be 
the quotient and remainder of i divided hj i — k, i.e., i = q{i — k) + r where < r < i — k. 
It can be shown that X = Y if and only if Sj ~ is a string of the form UVUV . . . UVU 
where U and V have length r and i — k — r. Since the number of U and V are, respectively, 
g + 2 and q + 1, the probability of observing a structure of the form UVUV. . . UVU is 
given by 
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where (a) comes from the fact that (vrg)9 < [112)'^ for all q > 2. Since k < i, we have 
(712) 2 < (vr2)2. Therefore, the probability that X = Y for two overlapping strings is 
bounded above by 

This completes the proof. D 

Proof of Theorem [T], Part 2 

The greedy algorithm finds a contig corresponding to a substring of the DNA sequence if 
each read Rj is correctly merged to its successor read R|_^with the correct amount of physical 
overlap between them which is V^ = L — (T/ — Tj) mod gO If, in addition, the whole sequence 
is covered by the reads then the output of the algorithm is exactly the DNA sequence S. 

Let £1 be the event that some read is merged incorrectly; this includes merging to the 
read's valid successor but at the wrong relative position as well as merging to an impostoc. 
Let £2 be the event that the DNA sequence is not covered by the reads. The union of these 
events, £^1 U £^2, contains the error event £. We first focus on event £1. 

Since the greedy algorithm merges reads according to overlap score, we may think of the 
algorithm as working in stages starting with an overlap score of L down to an overlap score 
of 0. Thus £1 naturally decomposes as £1 = UpAp,, where Ap is the event that the first error 
in merging occurs at stage L 

Now, we claim that 

A C S, U C, , (16) 

where: 

Bi 4 {R^- ^ R^^, Uj <e,Vi< £, Wij = e for some i ^ j.} (17) 

Cp 4 {Kj = R^Uj = Vi < £,Wij = Hot some i^ 3.} (18) 

If the event Ae occurs, then either there are two reads Rj and R/s such that Rj is merged 
to its successor R^ but at an overlap larger than their physical overlap, or there are two reads 
Rj and Rj such that Rj is merged to Rj, an impostor. The first case implies the event Cp. In 



■^Note that the physical overlap can take negative values. 
■^A read Rj is an impostor to Ri if W^(Ri, R^) > Vi. 
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the second case, in addition to Wij = i, it must be true that the physical overlaps Vi,Uj < £, 
since otherwise at least one of these two reads would have been merged at an earlier stage. 
(By definition of Ae, there were no errors before stage i). Hence, in this second case, the 
event Bi occurs. 

Now we will bound P(i3^) and P(Q). 

First, let us consider the event Bf. This is the event that two reads which are not 
neighbors with each other got merged by mistake. Intuitively, event Bg says that the pairs 
of reads that can potentially cause such confusion at stage i are limited to those with short 
physical overlap with their own neighboring reads, since the ones with large physical overlaps 
have already been successfully merged to their correct neighbor by the algorithm in the early 
stages. In Figure El these are the reads at the ends of the contigs that are formed by stage 



For any two distinct reads Rj and Hj, we define the following event 

B'/ ^ {R, ^ R,^ U, <l,V,< £, W,, = £} 



From the definition of Bi in (TT7|) . we have Be C UijB}-' . Applying the union bound and 
considering the fact that Bg"s are equiprobable yields 

F{Be) < Ar2p(i3f ). 

Let V be the event that the two reads Ri and R2 have no physical overlap. Using the 
law of total probability we obtain 

P(i3f ) = P(i3f |P)P(P) + P(i3f |P^)P(P'=). 

Since V^ happens only if T2 G [Ti - L + 1, Ti + L - 1], P(P^) < ^. Hence, 

or 

P(Sf ) < P(i3f |P) + F{Bl^\V')—. (19) 

Gr 

We proceed with bounding F{B}'^\V) as follows, 

¥{Bf\V) = F{U2 <i,Vi< i, Wu = i\V) 

= HU2 <i,Vi< i\V)F{Wi2 = i\V) 



(b) 
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where (a) comes from the fact that given X>, the events {U2 < i,Vi < £} and {Wu = i} are 
independent, and (6) follows from Lemma fT2] part 1. 

Note that the event {R2 7^ Hf, f/2 < ^j Vi < i} implies that no reads start in the intervals 
[Ti,Ti + L — i — 1] and [T2 — L + £ + 1, T2]. Given V, if the two intervals overlap then there 
exists a read with starting position Tj with Tj G [Ti,Ti + L — i—l] or Tj G [T2 — L + £ + 1 , T2] . 
To see this, suppose Tj is not in one of the intervals then R2 has to be the successor of Ri 
contradicting R2 7^ R^ If the two intervals are disjoint, then the probability that there is 
no read starting in them is given by 



G J 
Using the inequality 1 — a < e~", we obtain 

P(i3f ID) < e-2^(^-^)(i-2/^)2-^^^(P) (20) 

To bound F{Bj'^\V^), we note that it has a non-zero value only if the length of physical 
overlap between Ri and R2 is less than i. Hence, we only consider physical overlaps of 
length less that i and denote this event by Vi. We proceed as follows. 



< P(Vi < i, Wu = ^\Di) 

< P(V^i < ^|Pi)2-^^^(P)/2 

where (a) comes from the fact that given Pi, the events {Vi < £} and {Wu = i} are 
independent, and (6) follows from Lemma [T2] part 2. Since {Vi < i} corresponds to the 
event that there is no read starting in the interval [Ti, Ti + L — £ — 1], we obtain 

nvi < em = 1^1- -^j . 

Using the inequality 1 — a < e~", we obtain 

p(^12|p2) < g-A(L-£){l-2/iV)2-^H2(p)/2_ 
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Putting all terms together, we have 

nSe) < qj + 2XLqe. (21) 

where 

q, = AG'e'"(^-^)(^-2/iV)2-^H,(p)/2_ ^22) 

The first term reflects the contribution from the reads with no physical overlap and the 
second term from the reads with physical overlap. Even though there are lots more of the 
former than the latter, the probability of confusion when the reads are physically overlapping 
can be much larger. Hence both terms have to be considered. 
Let us define 

Ci^{V,<i,W{R,,Rt)=i}. 

From the definition of Ce in ( IT8l) . we have Ce C UjQ. Applying the union bound and 
considering the fact that Q's are equiprobable yields 

Ce < NF{CJ). 
Hence, 

P(Q) < NF{WiRuRt) = ^\Vi < iMVi < i) 
Applying Lemma [T^ part 2, we obtain 

P(C^) < iVe-^^^(p)/2 ( 1 



L-iV-' 



G J 
Using the inequality 1 — a < e~'^, we obtain 

P(Q) < AGe"^(^-^)(i"i/^)2-^^2(p)/2 

< qi (23) 



Using the bounds, (l2Ti) and ( l23l) . we get 

F{S,) = F{UeAe) < J2 n^^) = E ^i^i) + ^i^i) <114 + (2AL + l)g,, 
£=0 1=0 e=o 
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where qi is defined in fl2^ . Since qi is nionotonic in £, we can further bound P(£^i) by: 

P(^i) <{L + l) max [ql + (2AL + l)go, ql + (2AL + l)ql} . (24) 

Since L > 7^-^, Q'l vanishes exponentially in L and the second term on the right hand 
side of (Ell) has no contribution asymptotically. Now, choose 

Ar = ^ln(GL3). 

A direct computation shows that for this choice of A^, q^ + (2AL + l)go = 0(-^). Hence, the 
bound ^IM implies that P(^i) — )■ 0. Moreover the probability of no coverage "^{£2} also goes 
to zero with this choice of A^. Hence, the probability of error in reconstruction ¥{£) also goes 
to zero. This implies that the minimum number of reads required to meet the reconstruction 
error probability of at most e satisfies: 

A^min(e,G',L)<^ln(GL3) 

for sufficiently large G and L with L/\ogG = L. Hence, this implies that 

limsup = = < 1. 

L,G-5>oo,L/logG=L G/L 

Combining this with Lemma [31 we get: 

Ar^i„(e,G,L) 
limsup = — - — - — < 1. 

L,G-s-oo,L/logG=L J^cov[^,G,L) 

But since A'mm(e, G, L) > A'cov(e, G, L), it follows that: 

,. Ar^in(e,G,L) 

lim _ = — — — - = 1, 

L,G^oo,L/logG=L Ncov{e,G,L) 

completing the proof. 



B Proof of Theorem E 



The stationary distribution of the source is denoted by p = (j9i,P2)P3,P4)*- Since Q has 
positive entries, the Perron- Frobenius theorem implies that its largest eigenvalue Pmax(Q) 
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is real and positive and the corresponding eigenvector tt has positive components. The 
following inequality is useful: 



max {-XwQ'-^-n: 
{1,2,3,4} IvrJ 



2 2 



ie{l, 2,3,4} LvTj J " ' 

= max i — I (pmax(<5)) I|7t||i 
ie{i,2,3,4} IttJ ^ ^ 

= 7(pmax(g))' (25) 

where7 = max,e{i,2,3,4}{^-3Qy}. 

B.l Proof of Lemma \7\ 

In [2] , Arratia et al. showed that interleaved pairs of repeats are the dominant term causing 
non-recoverability. They also used poisson approximation to derive bounds on the event 
that S is recoverable from its L-spectrum. We take a similar approach to obtain an upper 
bound under the Markov model. First, we state the following theorem regarding Poisson 
approximation of the sum of indicator random variables, c.f. Arratia et al. |1]. 

Theorem 13 (Chen-Stein Poisson approximation). Let W = J2a(^i Xa where Xa's are indi- 
cator random variables for some index set I . For each a, Ba C / denotes the set of indices 
where Xa is independent from the a-algebra generated by all xp with /3 G / — Ba- Let 

h = Y.Y. nxa\nxpi (26) 

&2 = E E nXcXp]. (27) 



Then 



dTv{W, W) < ^-^(&l + &2), (2^ 



where 6 = E[iy] and drviW, W) is the total variation distanc^ between W and Poisson 
random variable W with the same mean. 



'''The total variation distance between two distributions W and W is defined by dT\i{W,W') 
supy^gjr |Pvi/(^) — lPiv(^)l: where F is the cr-algebra defined for W and W' 
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Proof of Lemma [7] Let U denote the event that there is no two pairs of interleaved repeats 
in the DNA sequence. Given the presence of k repeats in S, the probabihty of U can be 
found by using the Catalan numbers [2]. This probability is 2^ /{k + 1)!. \i Z denotes the 
random variable indicating the number of repeats in the DNA sequence, we obtain, 

To approximate P(ZY), we partition the sequence as 

S = S'iXiS'l+2X2S'2(L+i)+iX3 . . . S{^K~l){L+l)+l^K 

where Xj = S[(i - 1)(L + 1) + 2,i{L + 1)] and K = -^. Each Xj has length L and will 
be denoted by Xj = Xn . . . Xn. We write Xj ~ Xj with i ^ j to mean Xn ^ Xji and 
Xik = Xjk for 2 < k < L. In other words, Xj ~ Xj means that there is a repeat of length 
at least L — 1 starting from locations (z — 1)(L + 1) + 3 and (j — 1)(L + 1) + 3 in the DNA 
sequence and the repeat cannot be extended from left. The requirement Xn ^ Xji is due 
to the fact that allowing left extension ruins accuracy of Poisson approximation as repeats 
appear in clumps. 

Let I = {{i, j)\l < i < j < K}. Let Xa with a & I denote the indicator random variable 
for a repeat at a = (i, j), i.e., Xa = l(Xi ~ Xj). Let W = J2aei^a- Clearly, 

ni^)<Ej^nw = k). 

Letting Y = SiSl+2 ■ ■ ■ S(^k~i){l+i)+i, we obtain 

P(W) <Y.Y1 7T^^(^ = ^|Y)P(Y). 
For any Y, let e be the total variation distance between W and its corresponding Poisson 
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distribution W with mean 6y = E,[W\Y]. Then, we obtain 



oo 



Y 

We assume ^y > 8 for all Y and let 6 = minv ^y- For this region, the exponential factor 
within the summation is monotonically decreasing and 



P(W) < e + e-'+'^^'r (29) 

To calculate the bound, we need to obtain an upper bound for e and a lower bound for 
6. We start with the lower bound on 6. From Markov property and for a given a = {i,j), 

. \¥{Xii ^ Xji\Y)\ ,^-^ 2 2 2 

K "«l ) iii2-..ii 

L 



= C (Pmax(Q) 

where C = niin < ' ^/An k Therefore 



0Y = EE[X«I>^] > (f y (pmax(Q))^ = ^. (30) 

To bound e, we make use of the Chen-Stein method. Let Ba = {{i',j') G I\i' = i or j' = 
j}. Note that Ba has cardinality 2K — 3. Since given Y, Xa is independent of the sigma- 
algebra generated by all x^s, f3 E I — Ba, we can use Theorem [T3] to obtain 

dMW,W'\Y)<^-^^, (31) 

Uy 
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where bi and 62 are defined in (I2B]) and (1271) . respectively. Since E[XoX^] = E[Xq,]E[X^] for 
all a 7^ /3 G i?a, we can conclude that 62 < &i- Therefore, 



Since 9 < 9y, 



dT:y{W,W'\Y)<-^. 

Uy 



2h 



In order to compute 61, we need an upper bound on E[xo|Y]. By using (1251) . we obtain 

E[X«|Y]= E P(Xa^X,i|Y)g^g^ ...g^^,^_^ 



Hence, 



^ \^ 22 2 

ili2...iL 
< 1 iPmaxiQ)] 



^1 = E E nXa\Y]E[xp\Y], 



<(2ir-3) y(pmax(g))" 



< 



472^2 



Using the bound for 61, we have the following bound for the total variation distance. 

8^25) 

Form the above inequality, we can choose e = ^J^. Substituting in (12^ yields 

P(W) < ^ + e-^+2v^. (32) 

From the definition of 9 in ( l30l) , we have 

2K 
Therefore, if 2 > Llogf — ho)] ^^en 9 and j^ go, respectively, to infinity and zero expo- 
nentially fast. Since the right hand side of (I32p approaches zero, we can conclude that with 
probability 1 — o(l) there exists a two pairs of interleaved repeats in the sequence. This 
completes the proof. 
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B.2 Proof of Lemma [8] 

The proof follows closely from that of the i.i.d. model. In fact, we only need to replace 
Lemma [T2] with the following lemma. 

Lemma 14. For any distinct substrings X and Y of length i of the Markov DNA sequence: 

1. If the strings have no physical overlap, the probability that they are identical is bounded 
above &^72^^°§(^-^''('^)). 

2. If the strings have physical overlap, the probability that they are identical is bounded 
above by y^2^'°s('''-^''('^))/2^ 

Proof. 1- From Markov property, we can show that 

P(X = Y)= ^ FiX, = Y, = t,)ql^ql^...ql^_^ 

iii2...i£ 
^ "ST^ 2 2 2 

ili2...ii 

< 7 (Pmax(Q)^ 



where the last line follows from 

2- Without loss of generality, we assume that X = S[l, £] and Y = S[i — k + l,2i — k] 
for some k G {1, ...,£— 1}. Let q and r be the quotient and remainder of dividing 21 — k 
hj i — k. From decomposition of S[l, 2£ — k] as U1U2 . . . UgV where |Ui| = i — k for all 
i E {I, . . . ,q} and | V| = r, one can deduce that X = Y if and only if Uj = S1S2 ■ ■ ■ Si^k for 
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alH G {1, . . . , q} and V = S1S2 . . . Sr- Hence, we have 

P(X = Y) = P(S[1, 2i-k] = VV... UV) 

/ y Ph \Qi2i1Qi3i2 ■ • • Qiiie~kJ \Qi2i1Qi3i2 ■ ■ ■ Qirir-l 



1\12. 



(a) 



— \ 2^Piii 2-^ \Hi2i\ii3i2 ■ ■ ■ ^hii-t) yii2il^i-ii2 ' ' ' ^irir-i 



iii2---ie-k 



ib) 



— \ 2-^ \li2i\ii-^i2 ■ ■ ■ %iii-k) yli2ii^i3i2 ■ ' ■ "' 



2 

1312 ■ ■ ■ ^irir-l 



ll«2- 



(c) 



— J 2^ '?i2n'?«3«2 ■ ■ ■ ^i2l-ki2t-k-l 



I\l2---l2l-k 



W I '/ - \2e~k 

< \jl (Pmax(<5)j 

= V7(Pmax(<5)j 

< y7(Pmax(Q))' , 



where (a) follows from the Cauchy-Schwarz inequality and (6) follows from the fact that 
X^i-Pf < 1- In (c), some extra terms are added to the inequality, {d) comes from (123]) and 
finally (e) comes from the fact that k < i and Pmax(Q) ^1- D 

C Proof of Theorem lU] 

As explained in Section I4.H the criterion for overlap scoring is based the MAP rule for 
deciding between two hypotheses: Hq and Hi. The null hypothesis Hq indicates that two 
reads are from same physical source subsequence. Formally, we say two reads R, and Rj 
have the overlap score Wij = if if if is the longest suffix of Rj and prefix of Rj passing the 
criterion fITT]) . 

Let /(£) = (! + £)' , where jA"! is the cardinality of the channel's output symbols. The 
following theorem is a standard result in the hypothesis testing problem, c.f. Chapter 11.7 
of in. 

Theorem 15. Let X and Y be two random sequences of length L For the given hypotheses 
Hq and Hi and their corresponding MAP rule 07]) . 
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and 

AHi\Ho) < f(i)2-^D{P,\\P^,r)^ 

where 

p( . [PxA^,y)nPx{x)PY{y)Y->^ 

and fi is the solution of 

D{P^\\Px ■ Py) - D{P^\\PxA = e. 

Parallel to the proof of the noiseless case, we first prove the following lemma concerning 
erroneous merging due to impostor reads. 

Lemma 16 (False alarm). For any distinct i-mers X and Y from the set of reads: 

1. If the two (i-mers have no physical overlap, the probability that Hq is accepted is 

f^i)2-'DiP,\\p^.py)_ (33) 

2. If the two i-mers have physical overlap, the probability that Hq is accepted is 

7/(£)2"^^(^''ll^^-^^)/2, (34) 

where 7 is a constant. 

Proof. The proof of the first statement is an immediate consequence of Theorem [T3 

We now turn to the second statement. We only consider i = 2k and the other case 
can be deduced easily by following similar steps. Let Xj = log pu^)p( .) ■ Since x/s are not 
independent, we cannot directly use Theorem [T^ to compute P (Z]j=i Xj ^ ^^)- However, we 
claim that Xi's can be partitioned into two disjoint sets Ji and J2 of the same size, where 
the Xj's within each set are independent. Assuming the claim, 
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where (a) follows from the union bound. Since |Ji| = |, one can use Theorem [TS] to show 



m 

It remains to prove the claim. To this end, let k be the amount of physical overlap 
between X and Y. Without loss of generality, we assume that S1S2 ■ ■ ■ S2i+k is the shared 
DNA sequence. Let q and r be the quotient and remainder of i divided by 2{i — k), i.e. 
i = 2q{i — k) + r where < r < 2{£ — k). Since i is even, r is even. Let Ji be the set of 
indices j where either (j mod 2(£ — A;)) G {0, 1, ...,£ — A; — 1} for j G {1, ... , 2q{£ — k)} or 
j G {2q{£ — A;) + 1, ... , 2q{i — k) + |}. We claim that the random variables Xj's with j G Ji 
are independent. We observe that Xj depends only on Sj and Sj+(£_fc). Consider two indices 
ji < J2 e Ji. The pairs (s^-,, Sj^+(e~k)) and (sj^, Sj^+(^i^k)) are disjoint iff ji + {£ - k) ^ j2- By 
the construction of Ji, one can show that ji + {£ — k) ^ J2 for any ji < J2 G Ji. Hence, Xj's 
with j G Ji are independent. A similar argument shows Xj's with j G J2 = {1, . . . , 2i—k} — Ji 
are independent. This completes the proof. D 

Due to noise, two physically overlapping reads may not pass the criterion. To deal with 
this event, we state the following lemma. 

Lemma 17 (Mis-detection). Let X and Y be two distinct i-mers from the same physical 
location. The probability that Hi is accepted is bounded by 

f(^lp-eD{P^\\Px,Y)_ 

Proof. This is an immediate consequence of Theorem [T51 □ 

Proof of Theorems llOt Similar to the proof of achievability result in the noiseless 
case, we decompose the error event S into Si U S2 where Si is the event that some read is 
merged incorrectly and S2 is the event that the DNA sequence is not covered by the reads. 
The probability of the second event, similar to the noiseless case, goes to zero exponentially 
fast if _R > Z. We only need to compute P(£^i). Again, Si can be decomposed as Si = UiAi, 
where Ae is the event that the first error in merging occurs at stage i. Moreover, 

Ae^BiUCi, (35) 
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where: 

B^ 4 {R^. ^ R^^, Uj <i,Vi< i, Wij = £ for some i^j.} (36) 

Q A |j^^. _ Yil Uj = Vi^i, Wij = i for some i^j.} (37) 

Note that here the definition of Ce is different from that of ( TT51) as for the noiseless reads the 
overlap score is never less than the physical overlap. However, in the noisy reads there is a 
chance for observing this event due to mis-detection. 

The analysis of Be follows closely from that of the noiseless case. In fact, using Lemma 
[TBI which is a counterpart of Lemma [12] and following similar steps in calculation of F[Bi) in 
the noiseless case, one can obtain 

P(0£)</(^)(gf + 27ALg,), (38) 

where 

To compute fi^Ce), we note that C^ C UjQ, where 

Applying the union bound and considering the fact that Q's are equiprobable yields 

Ce < N¥{CJ). 
Hence, 

P(Q) < iV(P(iy(R„ R,^) >e\V, = i)+ nW{R,, R^) < i\V, = £))¥{V, < £). 
Using Lemma [TBI part 2 and Lemma [T71 yields 

where 

q'^ = _x(;;g-«£'(PMll^x,y)2-^(i-^)(i-i/Af)_ 
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Combining all the terms, we obtain 

i=o e=o 

To show that that P(£^i) — > 0, it is sufficient to argue that go, Qo, Ql, and g^ go to 
zero exponentially in L. Considering ffist go and q'^, they vanish exponentially in L if 
A^ > G\nG/L which implies Cmin(L) = 1. The terms g^ and g^ vanish exponentially in 
Lif 

^ mm{2D{P^\\Px,Y),D{P^\\Px ■ Py))' 
Since P(£^i) = o(l) and ¥{82) = o(l) for any choice of 6, one can optimize over 6 to obtain 

the result given in the theorem. This completes the proof. 
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